Character Stream Parsing of Mixed-lingual Text
نویسندگان
چکیده
In multilingual countries text-to-speech synthesis systems often have to deal with sentences containing inclusions of multiple other languages in form of phrases, words or even parts of words. Such sentences can only be correctly processed using a system that incorporates a mixed-lingual morphological and syntactic analyzer. A prerequisite for such an analyzer is the correct identification of word and sentence boundaries. Traditional text analysis applies to both problems simple heuristic methods within a text preprocessing step. These methods, however, are not reliable enough for analyzing mixed-lingual sentences. This paper presents a new approach towards word and sentence boundary identification for mixed-lingual sentences that bases upon parsing of character streams. Additionally this approach can also be used for word identification in languages without a designated word boundary symbol like Chinese or Japanese. To date, this mixed-lingual text analysis supports any mixture of English, French, German, Italian and Spanish.
منابع مشابه
Mixed-lingual text analysis for polyglot TTS synthesis
Text-to-speech (TTS) synthesis is more and more confronted with the language mixing phenomenon. An important step towards the solution of this problem and thus towards a socalled polyglot TTS system is an analysis component for mixedlingual texts. In this paper it is shown how such an analyzer can be realized for a set of languages, starting from a corresponding set of monolingual analyzers whi...
متن کاملOne Tree is not Enough: Cross-lingual Accumulative Structure Transfer for Semantic Indeterminacy
We address the task of parsing semantically indeterminate expressions, for which several correct structures exist that do not lead to differences in meaning. We present a novel non-deterministic structure transfer method that accumulates all structural information based on cross-lingual word distance derived from parallel corpora. Our system’s output is a ranked list of trees. To evaluate our s...
متن کاملCross-lingual Transfer for Unsupervised Dependency Parsing Without Parallel Data
Cross-lingual transfer has been shown to produce good results for dependency parsing of resource-poor languages. Although this avoids the need for a target language treebank, most approaches have still used large parallel corpora. However, parallel data is scarce for low-resource languages, and we report a new method that does not need parallel data. Our method learns syntactic word embeddings ...
متن کاملA Cross-Lingual Induction Technique for German Adverbial Participles
We provide a detailed comparison of strategies for implementing medium-tolow frequency phenomena such as German adverbial participles in a broadcoverage, rule-based parsing system. We show that allowing for general adverb conversion of participles in the German LFG grammar seriously affects its overall performance, due to increased spurious ambiguity. As a solution, we present a corpus-based cr...
متن کاملTübingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing
This paper describes our systems and results on VarDial 2017 shared tasks. Besides three language/dialect discrimination tasks, we also participated in the cross-lingual dependency parsing (CLP) task using a simple methodology which we also briefly describe in this paper. For all the discrimination tasks, we used linear SVMs with character and word features. The system achieves competitive resu...
متن کامل